My name is Sokona Mangane and I’m from Brooklyn, NY. I’m a senior at Bates College, majoring in Mathematics, and minoring in Digital and Computational Studies. In conjunction with the Institute for a Racially Just, Inclusive, and Open STEM Education (RIOS) Institute , I am conducting a computational text analysis of STEM Open Education Resources (OER). In particular, I’m analyzing the “Inclusive Teaching” Section descriptions on the website CourseSource, which is a open-access and peer-reviewed journal that publishes lessons, teaching content and resources related to biology and physics; in the words of Dr. Carrie Diaz Eaton (an Associate Professor of Digital and Computational Studies at Bates College and a co-founder of QUBES, known for her work in social justice in STEM higher education), it’s like a GitHub, but for curriculum.
When publishing an article on CourseSource (can be categorized as a “Lesson”, “Science Behind the Lesson”, “Teaching Tools and Strategies”, “Essay” or “Review”), authors can write about how the article is inclusive, under the “Inclusive Teaching”, however there are currently no guidelines. Thus, this text analysis of OER submissions serves to answer what do people write about Inclusive Teaching.
Here, I’ve imported the excel data set and necessary packages for analysis. I also did some data cleaning, created a vector of DEI related words and added some variables to the original excel data set.
#cmd + shift + c to comment out code
#cmd + shift + M to print %>% pipe operator
#cmd + return to run code
# install.packages("varhandle")
# install.packages("skimr")
# install.packages("tidyverse")
# install.packages("tidytext")
# install.packages("stopwords")
# install.packages("wordcloud")
# install.packages("reshape2")
# install.packages("ggraph")
#install.packages("kableExtra")
#loading necessary packages
library(varhandle)
library(ggraph)
library(igraph)
library(skimr)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(readr)
library(stopwords)
library(wordcloud)
library(reshape2)
library(kableExtra)
library(SnowballC)
#importing dataset and DEI words list
rios_data <- read_csv("RIOS Research - Course Source - Sheet1 2.csv")
dei_keywords <- read_csv("SJEDI_words 2022-12-20 18_03_42.csv")
#updating human error for one article
rios_data$`Inclusive Teaching included?`[12] = "No"
#arranging years
rios_data <- rios_data %>%
arrange(desc(Year))
# creating a new column article number, to number each article (most recent article: 286)
rios_data$article_num <- c(nrow(rios_data):1)
# I've created a variable which contains diversity related words (words pulled from the keywords column) and then combined it with the `dei_keywords` dataframe I imported (Thank you Dr. Diaz-Eaton). I also added another column, which includes the article number for each row.
# diversity_related <- c("diversity", "bias", "confirmation bias", "cognitive bias", "social justice", "broader impacts", "racism", "identity", "equity", "inclusivity", "environmental justice", "inclusion", "belonging")
#
# #adding the vector above to the CSV dei_keywords
# for (x in 1:13){
# dei_keywords[nrow(dei_keywords) + 1,] = diversity_related[x]
# }
Here, each word from the Inlcusive Teaching Description
and keyword themes is “un-nested” into it’s own row and any
unnecessary punctuation, numbers and words are removed. I saved each of
these in their own new dataframes for analysis.
rios_data_tokenizedit <- rios_data %>%
unnest_tokens(output = inclusive_teach_tokens, input = `Inclusive Teaching Description`)
#removing all rows with any punctuation, digits, or "stopwords" (~20k rows total)
strings <- c("[:punct:]", "[:digit:]","\\(","\\)")
stopwords_vec <- stopwords(language = "en")
stopwords_vec <- stopwords_vec[-c(165:167)]
#removed ~777 rows
rios_data_tokenizedit <- rios_data_tokenizedit %>%
filter(!str_detect(inclusive_teach_tokens, paste(strings, collapse = "|")))
#removed ~19,663 rows
rios_data_tokenizedit <- rios_data_tokenizedit %>%
filter(!inclusive_teach_tokens %in% stopwords_vec)
#doing same thing as above but for keyword themes
# rios_data_tokenizedkt <- rios_data %>%
# unnest_tokens(output = keyword_themes_tokens, input = `keyword themes`)
#removing all rows with any punctuation, digits, or "stopwords" (78 rows total)
# strings <- c("[:punct:]", "[:digit:]","\\(","\\)")
# stopwords_vec <- stopwords(language = "en")
#removed ~777 rows
# rios_data_tokenizedkt <- rios_data_tokenizedkt %>%
# filter(!str_detect(keyword_themes_tokens, paste(strings, collapse = "|")))
#removed ~19,663 rows
# rios_data_tokenizedkt <- rios_data_tokenizedkt %>%
# filter(!keyword_themes_tokens %in% stopwords_vec)
Based on the work I did above, I transformed the data frame with all of the distinct ‘cleaned’ words in the inclusive teaching section into a csv, and manually verified if all 4,464 words (looked at the context of the words as required) should be counted as JEDI! After manual verification, I import it back into R.
#allwords <- unique(rios_data_tokenizedit$inclusive_teach_tokens)
#uniquedeirelated <- sapply(allwords, function(x) any(sapply(dei_keywords, str_detect, string = x)))
#uniquedei <- cbind(allwords,uniquedeirelated)
#write_csv(as.data.frame(uniquedeirelated), "DEIRelated.csv")
#importing manually verified list of JEDI words
JEDI_keywords_df <- read_csv("cleanedITwords - cleanedITwords.csv")
JEDI_keywords <- JEDI_keywords_df %>%
filter(Carrie == "JEDI") %>%
select(1)
dei_related columns were created for each data frame,
which says TRUE if that word (regarding the Inclusive Teaching
Descriptions and the keyword themes) matches any word from the
JEDI_keywords dataframe.
#creating a DEI related column
rios_data_tokenizedit$dei_relatedit = NA
# rios_data_tokenizedkt$dei_relatedkt = NA
rios_data_tokenizedit$dei_relatedit <- sapply(rios_data_tokenizedit$inclusive_teach_tokens, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
# rios_data_tokenizedkt$dei_relatedkt <- sapply(rios_data_tokenizedkt$keyword_themes_tokens, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
#save(dei_keywords, file = "dei_keywords.csv")
#saveRDS(rios_data_tokenized, file = "rios_data_tokenized.rds")
#removing the unnnecessary columns
rios_data_tokenizedit <- rios_data_tokenizedit[,-c(9:13)]
# rios_data_tokenizedkt <- rios_data_tokenizedkt[,-c(9:13)]
The boxplot below visualizes the word count of the Inclusive Teaching Section over time. The Word count increases in 2019 compared to the years prior, and we also start to see more outliers. Overall, since the creation of Course Source, the word count of the Inclusive Teaching Section has increased.
#facotring years so i can reorder from oldest to most recent
rios_data$Year <- factor(rios_data$Year , levels=c("2022", "2021", "2020", "2019", "2018", "2017", "2016", "2015", "2014"))
#coloring groups (color blind safe colors) based on the shift (2014-2018 and 2019-2022)
boxplot(`Word Count of Inclusive Teaching?`~ Year,
data=rios_data,
main="Word Count of Inclusive Teaching Sections Over Time",
ylab="Year",
xlab="Word count of Inclusive Teaching Section",
horizontal = TRUE,
col = c("#1f78b4", "#1f78b4", "#1f78b4", "#005AB5", "#DC3220","#DC3220", "#DC3220", "#DC3220", "#DC3220"))
abline(v = 118.4868, col = "#DC3220", lty = "solid", lwd = 3)
abline(v = 216.0667, col = "#005AB5",lty = "solid", lwd = 3)
legend("topright", inset=.02, title="Average Word Count",
c("for 2014 - 2018","for 2019 - 2022"), fill=c("#DC3220", "#005AB5"), horiz=TRUE, cex=0.8)
rios_data$Year <- unfactor(rios_data$Year)
#1f78b4
#b2df8a
Presented below is an in-depth look at what’s visualized above.
rios_data %>%
group_by(Year) %>%
skim(starts_with("Word Count")) %>%
select(3,4,6:13) %>%
mutate(numeric.mean = round(numeric.mean, digits = 2), numeric.sd = round(numeric.sd, digits = 2), variance = (numeric.sd)^2) %>%
rename("Mean" = "numeric.mean",
"SD" = "numeric.sd",
"Variance" = "variance",
"Min" = "numeric.p0",
"25 Q" = "numeric.p25",
"Median" = "numeric.p50",
"75 Q" = "numeric.p75",
"Max" = "numeric.p100",
"Histogram" = "numeric.hist") %>%
kable() %>%
kable_minimal()
| Year | n_missing | Mean | SD | Min | 25 Q | Median | 75 Q | Max | Histogram | Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 4 | 106.85 | 58.05 | 34 | 63.00 | 90.0 | 133.00 | 230 | ▇▇▆▁▃ | 3369.802 |
| 2015 | 4 | 122.57 | 61.70 | 43 | 70.50 | 116.0 | 174.25 | 228 | ▇▅▃▂▅ | 3806.890 |
| 2016 | 3 | 115.80 | 89.83 | 26 | 79.00 | 103.0 | 127.50 | 453 | ▇▅▁▁▁ | 8069.429 |
| 2017 | 2 | 123.00 | 56.55 | 37 | 83.50 | 107.0 | 154.00 | 238 | ▃▇▃▃▂ | 3197.903 |
| 2018 | 5 | 124.70 | 80.22 | 34 | 89.75 | 95.0 | 144.75 | 324 | ▆▇▃▁▂ | 6435.248 |
| 2019 | 4 | 173.33 | 114.77 | 25 | 98.75 | 141.5 | 213.75 | 483 | ▆▇▃▁▂ | 13172.153 |
| 2020 | 2 | 249.45 | 218.46 | 43 | 126.50 | 203.0 | 276.00 | 1415 | ▇▂▁▁▁ | 47724.772 |
| 2021 | 5 | 224.62 | 170.73 | 43 | 125.75 | 169.0 | 241.75 | 901 | ▇▃▁▁▁ | 29148.733 |
| 2022 | 1 | 210.74 | 118.28 | 41 | 124.00 | 183.0 | 249.50 | 565 | ▅▇▂▁▁ | 13990.158 |
#CREATING A BAR PLOT OF THE AVERAGE WORD COUNT OF THE TWO GROUPS
rios_data <- rios_data %>%
mutate(`Group Year` = case_when(
as.numeric(Year) >= 5 ~ "2014 - 2018",
as.numeric(Year) < 5 ~ "2019 - 2022"
)) #creating column that groups based on the shift (2014-2018 and 2019-2022), used 5 because it's factored and 2018 = level 5
# test <- t.test(formula = `Word Count of Inclusive Teaching?` ~ `Group Year`,
# data = rios_data)
#
# test #p-value = 1.557e-10, confidence interval: -126.37537 -68.78427, df = 254?!
rios_data %>%
group_by(`Group Year`) %>%
summarise(Avg_Wrd_Ct = mean(`Word Count of Inclusive Teaching?`, na.rm = TRUE), #calculating avg wrd count for each group
n = n(),
sd = sd(`Word Count of Inclusive Teaching?`, na.rm = TRUE)) %>% #calculating standard deviation for each group (for confidence intervals) %>%
mutate(se = sd/sqrt(n), #calculating the standard error
ic = se * qt((1-0.05)/2 + 0.5, n-1)) %>% #calculating standard error * value of the t-distribution for 0.5
ggplot(aes(`Group Year`, Avg_Wrd_Ct, fill = `Group Year`)) +
geom_col() +
labs(title = "Average Word Count of Inclusive Teaching Section", subtitle = "Before and After 2018", x = "Year", y = "Average Word Count") +
scale_fill_manual(values = c("#005AB5", "#DC3220")) + #manually coloring groups (color blind safe colors)
geom_errorbar(aes(x = `Group Year`, ymin = Avg_Wrd_Ct - ic, ymax = Avg_Wrd_Ct + ic), width = 0.4)
Only 2.6% of words in the Inclusive Teaching Text are DEI related (118/4,464). Looking at the most common DEI words gives us an idea of what DEI words are being used the most, and what does that tell us about how the authors are being inclusive. According to the table below, the words “inclusive”, “diversity”, and “diverse” are the most commons “DEI” words. This makes sense as inclusive teaching should be diverse and cater to a diversity of racial backgrounds. Out of 118 “DEI” words used in the inclusive teaching text, note that 70% of these words are repeated more than once (83/118) and 54.2% are repeated more than twice (64/118). Based on these common words, it seems like these articles try to be inclusive by being diverse, engaging, and catering to a diverse set of backgrounds and abilities.
However, the title of this section for which these descriptions are
under is called “Inclusive teaching”, so one could have a lengthy
description under this section, without including any of the words from
dei_keywords and then mention “inclusive teaching” to be
included in this category, thus, these numbers may be an underestimate.
Below (2 gram analysis) you can find a data frame
rios_2w_count that portrays the most common DEI 2 words
phrases.
#BSS = BASIC SUMMARY STATISTICS
#most common DEI words, out of 118 (out of 4,464 words, 2.6% are DEI)
rios_data_tokenizedit %>%
filter(dei_relatedit == "TRUE") %>%
count(inclusive_teach_tokens, sort = TRUE)
## # A tibble: 785 × 2
## inclusive_teach_tokens n
## <chr> <int>
## 1 students 1389
## 2 inclusive 123
## 3 diversity 115
## 4 diverse 100
## 5 opportunity 94
## 6 individual 72
## 7 engage 66
## 8 environment 64
## 9 backgrounds 57
## 10 participate 57
## # ℹ 775 more rows
# #creating stemmed DEI word list
# DEI_related <- rios_data_tokenizedit %>%
# filter(dei_relatedit == "TRUE") %>%
# select(10) #CREATING DF OF DEI INCLUSIVE TEACH TOKENS
#creating column of stemmed tokens
rios_data_tokenizedit <- rios_data_tokenizedit %>%
mutate(stem = wordStem(rios_data_tokenizedit$inclusive_teach_tokens, language = "en")) %>%
rename("inclusive_tokens_stem" = "stem")
#creating stemmed JEDI list
stem_jedi <- JEDI_keywords %>%
mutate(stem = wordStem(`allwords`, language = "en")) %>%
count(stem, sort = TRUE)
#stem_token_it$dei_related <- sapply(stem_token_it$stem, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
rios_data_tokenizedit %>%
filter(dei_relatedit == "TRUE") %>%
count(inclusive_tokens_stem, sort = TRUE)
## # A tibble: 497 × 2
## inclusive_tokens_stem n
## <chr> <int>
## 1 student 1389
## 2 divers 215
## 3 inclus 181
## 4 particip 167
## 5 engag 165
## 6 opportun 143
## 7 encourag 142
## 8 individu 133
## 9 addit 87
## 10 access 85
## # ℹ 487 more rows
Word clouds are another way of visualizing which words are being used the most. This word cloud shows the distinct words printed in the table above.
#ORIGINAL
rios_data_tokenizedit %>%
filter(dei_relatedit == "TRUE") %>%
count(inclusive_teach_tokens, sort = TRUE)
## # A tibble: 785 × 2
## inclusive_teach_tokens n
## <chr> <int>
## 1 students 1389
## 2 inclusive 123
## 3 diversity 115
## 4 diverse 100
## 5 opportunity 94
## 6 individual 72
## 7 engage 66
## 8 environment 64
## 9 backgrounds 57
## 10 participate 57
## # ℹ 775 more rows
#STEMMED AND LOGGED WORD CLOUD
rios_data_tokenizedit %>%
filter(dei_relatedit == "TRUE") %>%
count(inclusive_tokens_stem, sort = TRUE) %>%
mutate(log_n = log(n)) %>% #since there are so many words and it's very skewed i'm log scaling the count and then visualizing
with(wordcloud(inclusive_tokens_stem, n, min.freq = 2)) #words w/ frequency below 2 won't be plotted, this elimates about 33.2% of the data
The tables and visuals above give us an idea of how often are DEI Words used and what does that say about inclusivity of the articles. However, looking at the most commonly used DEI words doesn’t give us all the information on how the article is being inclusive and their definitions of it. Thus, I’ll repeat the analyses I did above, but looking at phrases, specifically of 2 words. Unlike above, I looked through all the phrases and removes what I felt was unnecessary and/or didn’t make sense. Below we have printed the most common DEI phrases. Although some of the phrases aren’t repeated often, they show that the definition of inclusive teaching goes beyond just engaging all students.
rios_data_token2it <- rios_data %>%
unnest_tokens(it_tokens_2w, `Inclusive Teaching Description`, token = "ngrams", n = 2) %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_data_token2it <- rios_data_token2it %>%
filter(!word1 %in% stopwords_vec) %>%
filter(!word2 %in% stopwords_vec) %>%
unite(it_tokens_2w, word1, word2, sep = " ")
rios_data_token2it <- rios_data_token2it %>%
filter(!str_detect(it_tokens_2w, paste(strings, collapse = "|")))
# code to print most common DEI 2word phrases for review
# #for review (Naz?)
# all2words <- as.data.frame(unique(rios_data_token2it$it_tokens_2w))
# all2words$dei_related = NA
# all2words$dei_related <- sapply(all2words$`unique(rios_data_token2it$it_tokens_2w)`, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
# write_csv(all2words, "2DEIRelated.csv")
#
#
# #most common DEI words
# rios_2w_count <- rios_data_token2it %>%
# filter(dei_related == "TRUE") %>%
# count(it_tokens_2w, sort = TRUE)
#
# write_csv(rios_2w_count, "rios2wcount.csv")
#creating a DEI related column
rios_data_token2it$dei_related = NA
#importing manually verified list of JEDI 2 word phrases
JEDI_2keywords_df <- read_csv("cleanedIT2words.csv")
#filtering words that haven't been checked and aren't JEDI
JEDI_2keywords <- JEDI_2keywords_df %>%
filter(JEDI_2keywords_df$...5 != "unsure" | is.na(JEDI_2keywords_df$...5)) %>%
filter(...3 == "JEDI") %>%
select(1)
rios_data_token2it$dei_related <- sapply(rios_data_token2it$it_tokens_2w, function(x) any(sapply(JEDI_2keywords, str_detect, string = x)))
#most common DEI words
rios_2w_count <- rios_data_token2it %>%
filter(dei_related == "TRUE") %>%
count(it_tokens_2w, sort = TRUE)
#graph of that
rios_2w_count %>%
top_n(30) %>%
mutate(it_tokens_2w = reorder(it_tokens_2w, n)) %>%
ggplot(aes(it_tokens_2w, n)) +
geom_col() +
coord_flip() +
labs(y = "(DEI Related) 2 Word Count in Inclusive Teaching Text") +
xlab(NULL)
I also did a 3 gram word Analysis, which has a much a lower frequency. However, this gives us a better idea of what “inclusive teaching” means in these contexts.
## plot the frequency
rios_data_token3it <- rios_data %>%
unnest_tokens(it_tokens_3w, `Inclusive Teaching Description`, token = "ngrams", n = 3) %>%
separate(it_tokens_3w, c("word1", "word2", "word3"), sep = " ")
rios_data_token3it <- rios_data_token3it %>%
filter(!word1 %in% stopwords_vec) %>%
filter(!word2 %in% stopwords_vec) %>%
filter(!word3 %in% stopwords_vec) %>%
unite(it_tokens_3w, word1, word2, word3, sep = " ")
rios_data_token3it <- rios_data_token3it %>%
filter(!str_detect(it_tokens_3w, paste(strings, collapse = "|")))
# code to print most common DEI 3word phrases for review
# all3words <- as.data.frame(unique(rios_data_token3it$it_tokens_3w))
# all3words$dei_related = NA
# all3words$dei_related <- sapply(all3words$`unique(rios_data_token3it$it_tokens_3w)`, function(x) any(sapply(JEDI_2keywords, str_detect, string = x)))
# #write_csv(all2words, "2DEIRelated.csv")
#
#
# #most common DEI words
# all3words <- all3words %>%
# filter(dei_related == "TRUE") %>%
# count(`unique(rios_data_token3it$it_tokens_3w)`, sort = TRUE)
#
# write_csv(all3words, "all3words.csv")
#creating a DEI related column
rios_data_token3it$dei_related = NA
rios_data_token3it$dei_related <- sapply(rios_data_token3it$it_tokens_3w, function(x) any(sapply(dei_keywords, str_detect, string = x)))
#removing the unnnecessary columns/rows
rios_data_token3it <- rios_data_token3it[,-c(9:13)]
#for review (Naz?)
# all3words <- as.data.frame(unique(rios_data_token3it$it_tokens_3w))
# all3words$dei_related = NA
# all3words$dei_related <- sapply(all3words$`unique(rios_data_token3it$it_tokens_3w)`, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
# write_csv(all3words, "3DEIRelated.csv")
#graph of the most common
rios_data_token3it %>%
filter(dei_related == "TRUE") %>%
count(it_tokens_3w, sort = TRUE) %>%
top_n(30) %>%
mutate(it_tokens_3w = reorder(it_tokens_3w, n)) %>%
ggplot(aes(it_tokens_3w, n)) +
geom_col() +
coord_flip() +
labs(y = "(DEI Related) 3 Word Count in Inclusive Teaching Text") +
xlab(NULL)
This is a bar chart that looks more in depth into the frequency of words being used, each year. You can click on each tab to see to compare. Although 2014-2016 have higher proportions of DEI words, from 2018-2022, the top 20 DEI related words are used more frequently, specifically the words “diverse”, “diversity”, “engage”, “individual”, and “inclusive” are being used more. We can see that the number and diversity of words increases (there are decreases in 2018, and 2021) each year.
# Below is just the code to print all 9 years into one table
#saving for visuals on word counts, etc
it_word_counts <- rios_data_tokenizedit %>%
filter(dei_relatedit == "TRUE") %>%
group_by(Year) %>%
count(inclusive_teach_tokens, sort = TRUE)
it_stem_word_counts <- rios_data_tokenizedit %>%
filter(dei_relatedit == "TRUE") %>%
group_by(Year) %>%
count(inclusive_tokens_stem, sort = TRUE)
#original
it_word_counts %>%
# filter(Year != 2018) %>%
filter(n > 1) %>% #41.58 of data
ggplot(aes(inclusive_teach_tokens, n)) +
geom_col() +
#geom_text(aes(label = inclusive_teach_tokens), vjust = -0.5, size = 1, nudge_y = 1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
facet_wrap(~Year, ncol = 2) +
ylim(0,355) +
labs(title = "(DEI Related) Word Frequency Over Time", x = "(DEI Related) Word", y = "Word Count in Inclusive Teaching Text") +
## reduce spacing between labels and bars
scale_x_discrete(expand = c(.01, .01)) +
scale_fill_identity(guide = "none") +
## get rid of all elements except y axis labels + adjust plot margin +
theme(axis.text.y = element_text(size = 14, hjust = 1, family = "Fira Sans"),
plot.margin = margin(rep(15, 4)))
it_stem_word_counts %>%
# filter(Year != 2018) %>%
filter(n > 1) %>% #41.58 of data
ggplot(aes(inclusive_tokens_stem, n)) +
geom_col() +
#geom_text(aes(label = inclusive_teach_tokens), vjust = -0.5, size = 1, nudge_y = 1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
facet_wrap(~Year, ncol = 2) +
ylim(0,355) +
labs(title = "(DEI Related) Stemmed Word Frequency Over Time", x = "(DEI Related) Word", y = "Word Count in Inclusive Teaching Text") +
## reduce spacing between labels and bars
scale_x_discrete(expand = c(.01, .01)) +
scale_fill_identity(guide = "none") +
## get rid of all elements except y axis labels + adjust plot margin +
theme(axis.text.y = element_text(size = 7, hjust = 1, family = "Fira Sans"),
plot.margin = margin(rep(15, 4)))
# it_word_counts %>%
# ungroup(Year) %>%
# count(n) %>%
# mutate(percent = (nn/416)*100)
# it_word_counts %>%
# filter(inclusive_teach_tokens != "students") %>%
# group_by(Year) %>%
# summarise(max = max(n))
#
# it_stem_word_counts %>%
# filter(inclusive_tokens_stem != "student") %>%
# group_by(Year) %>%
# summarise(max = max(n))
it_word_counts %>%
filter(Year == 2014) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2015) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2016) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2017) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2018) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2019) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2020) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2021) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_word_counts %>%
filter(Year == 2022) %>%
head(20) %>%
ggplot(aes(inclusive_teach_tokens, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,35)
it_stem_word_counts %>%
filter(Year == 2014) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2015) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2016) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2017) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2018) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2019) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2020) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2021) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
it_stem_word_counts %>%
filter(Year == 2022) %>%
head(20) %>%
ggplot(aes(inclusive_tokens_stem, n)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylim(0,60)
This can help us see the “weight” of each words and the words most distinctive for each year. Here we’re printing the idf, which is the “inverse document frequency, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf [the term frequency and idf multiplied together], the frequency of a term adjusted for how rarely it is used. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites”.
For comparisons purposes, the y axis has the same limits for all the graphs, where we can see that the words “cultured” and disengaged” are the most distinct words compared to the other DEI related words for the rest of the years. These results make sense and align with the visuals above. Because the use of these words are very low in 2014 (besides diversity and diverse which was seen more than 5 times in 2014), they have a high tf-idf statistic. These words have been used more each year.
#finding the most distinctive words for each document
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2014) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2014", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2015) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2015", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2016) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2016", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2017) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2017", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2018) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2018", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2019) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2019", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2020) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2020", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2021) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2021", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)
it_word_counts %>%
bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
arrange(desc(tf_idf)) %>%
filter(Year == 2022) %>%
top_n(40) %>%
mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
ggplot(aes(inclusive_teach_tokens, tf_idf)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2022", x = "DEI Related Words", y = "Word Weight (tf-idf statistic)") + ylim(0,0.04)
To get a deeper understanding of how inclusive teaching is viewed, we
will be creating a network plot to look at the relationship between
words/phrases in the inclusive teaching section. The generated
igraph graph is called rios_phrase_network. It
has 41 words and 36 connections among them. Similar to what some of our
graphs have portrayed above, the words “inclusive”, “students”, and
“diverse” are connected to many other words.
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2014) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
labs(title = "Network Plot of (DEI Related) Word Relationship in 2014")
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2015) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2016) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2017) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2018) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2019) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2020) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2021) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rios_data_token2 <- rios_data_token2it %>%
separate(it_tokens_2w, c("word1", "word2"), sep = " ")
rios_phrase_network <- rios_data_token2 %>%
filter(dei_related == TRUE & Year == 2022) %>%
count(word1, word2, sort = TRUE) %>%
graph_from_data_frame()
set.seed(20181005)
a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
labs(title = "Network Plot of (DEI Related) Word Relationship in 2022")
CourseSource. QUBES. (n.d.). Retrieved October 2022, from https://qubeshub.org/community/groups/coursesource/
Dewsbury, B., & Brame, C. J. (2019). Inclusive teaching. CBE—Life Sciences Education, 18(2), 1–5. https://doi.org/10.1187/cbe.19-01-0021